2025 Leetcode Plan

Goal

The goal is to prepare for the interview with the full text search team. The team uses Lucene as the engine.

Topics

Below is a structured approach to preparing for interview questions related to full-text search—particularly as seen in systems like Apache Lucene. First, we identify 10 core topics relevant to full-text search and indexing. Then, under each topic, we list LeetCode problems whose techniques, data structures, or patterns can be conceptually applied or adapted when thinking about full-text search, indexing, or query processing.

Plan

Relevant LeetCode Problems by Topic:

(Note: LeetCode doesn't have direct "build an inverted index" problems, but the following problems involve data structures, string manipulation, and pattern searching that mirror concepts in search and indexing.)

Inverted Index & Frequency Mapping
- Conceptual match: Creating structures to quickly find occurrences of words.
- Problems:
  - [ ] #1065. Index Pairs of a String – Finding all index pairs matching words in a dictionary simulates how an inverted index might point back to positions of terms.
  - [X] #347. Top K Frequent Elements – Although about integers, it's a canonical frequency-based problem. 2024-12-30 Top K Frequent Elements - LeetCode leetcode.com ORG file: Solution 347. Top K Frequent Elements
  - [ ] #350. Intersection of Two Arrays II – Frequency counting and intersection logic can inform how posting lists intersect in inverted indexes.
Tokenization and Normalization
- Conceptual match: Splitting and cleaning input text before searching.
- Problems:
  - [X] #819. Most Common Word – Involves tokenizing and normalizing text input by removing punctuation, lowercasing, and counting words. Solution #0819 Most Common Word
  - [ ] #151. Reverse Words in a String – Requires careful splitting and trimming, similar to text normalization.
  - [ ] #937. Reorder Data in Log Files – Parsing and classifying strings into different categories.
Ranking and Scoring
- Conceptual match: Prioritizing results by frequency or relevance.
- Problems:
  - [ ] #692. Top K Frequent Words – Selecting top terms by frequency mimics the idea of ranking documents by term frequency.
  - [ ] #819. Most Common Word – Also demonstrates frequency-based selection.
  - [ ] #451. Sort Characters By Frequency – Similar principle to ranking terms by their counts.
  - [ ] #49. Group Anagrams – Grouping and categorizing words hints at organizing documents by shared characteristics.
Query Parsing and Expansion
- Conceptual match: Interpreting user queries and potentially expanding them.
- Problems:
  - [ ] #211. Design Add and Search Words Data Structure – Supports regex-like queries (e.g., wildcard '.'), illustrating query expansion/interpretation.
  - [ ] #139. Word Break – Parsing a string into valid words is analogous to query decomposition.
  - [ ] #648. Replace Words – Uses a trie to replace words efficiently, resembling how a query might be expanded using a dictionary.
  - [ ] #139. Word Break – Segmenting a query into known words, analogous to parsing a complex query string.
Fuzzy Search and Edit Distance
- Conceptual match: Handling approximate matches.
- Problems:
  - [ ] #72. Edit Distance – Core fuzzy matching metric.
  - [ ] #161. One Edit Distance – Simplified edit distance scenario. Useful to understand fuzzy matching logic.
  - [ ] #44. Wildcard Matching – Pattern matching with wildcards, similar to fuzzy queries.
  - [ ] #676. Implement Magic Dictionary – Checks if altering one character can form a dictionary word, approximating fuzzy lookups.
Prefix Trees (Tries) and Autocompletion
- Conceptual match: Trie-based indexes are common for prefix searches and suggestions.
- Problems:
  - [ ] #208. Implement Trie (Prefix Tree) – Core structure for prefix indexing.
  - [ ] #211. Add and Search Word – Extends trie concept to handle wildcard queries.
  - [ ] #642. Design Search Autocomplete System – Autocomplete functionality using trie and frequency counts.
  - [ ] #212. Word Search II – Uses a trie to efficiently find multiple words in a grid.
  - [ ] #677. Map Sum Pairs – A trie-based approach to sum values for keys with shared prefixes.
  - [ ] #745. Prefix and Suffix Search – Advanced trie usage combining prefix and suffix queries.
Suffix Arrays / Suffix Trees and Advanced String Indexing
- Conceptual match: Data structures for fast substring queries.
- Problems:
  - [ ] #28. Implement strStr; Find the Index of the First Occurrence in a String – Basic substring search. Solutions often mention KMP or other efficient substring search methods.
  - [ ] #1062. Longest Repeating Substring – Suffix array or suffix tree approaches can solve this efficiently.
  - [ ] #30. Substring with Concatenation of All Words – Complex substring search problem mimicking multi-term indexing.
  - [ ] #1044. Longest Duplicate Substring – Often solved with suffix arrays or suffix trees, mirroring complex indexing.
  - [ ] #718. Maximum Length of Repeated Subarray – Another substring-related challenge, can be approached with advanced string structures.
N-gram Indexing
- Conceptual match: Breaking text into chunks can mirror indexing terms in multi-word sequences.
- Problems:
  - [ ] #30. Substring with Concatenation of All Words – Searching for multiple words back-to-back is analogous to detecting n-grams.
  - [ ] #472. Concatenated Words – Identifying words formed by concatenating other words (akin to multi-gram analysis).
  - [ ] #336. Palindrome Pairs – Involves complex substring checks and could be approached by indexing substrings or parts of words.
  - [ ] #758. Bold Words in String – Highlighting occurrences of words can conceptually relate to identifying n-grams within text.
Efficient Substring Search (KMP, Rabin-Karp)
- Conceptual match: Core algorithms that can inspire indexing and retrieval strategies.
- Problems:
  - [ ] #438. Find All Anagrams in a String – Sliding window pattern matching, conceptually similar to scanning indexes.
  - [ ] #459. Repeated Substring Pattern – Examines the internal structure of strings, training one's intuition on substring patterns.
Phrase Queries and Proximity Search
- Conceptual match: Finding sequences of terms close together.
- Problems:
  - [ ] #79. Word Search – Searching for a phrase (word) in a grid, akin to proximity search in a corpus.
  - [ ] #212. Word Search II – Multiple word searches; tries can handle phrase-like queries efficiently.
  - [ ] #76. Minimum Window Substring – Finding the smallest substring containing all required characters parallels proximity queries.
  - [ ] #243. Shortest Word Distance – Compute minimal distance between words, analogous to checking proximity within text.
  - [ ] #244. Shortest Word Distance II – Data structure design to quickly answer proximity queries between words.

Let's go!

Goal

Topics

10 Main Topics Related to Full-Text Search and Indexing (inspired by Apache Lucene concepts)

Plan